Picture for Shanghang Zhang

Shanghang Zhang

Can World Models Benefit VLMs for World Dynamics?

Add code
Oct 01, 2025
Viaarxiv icon

MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

Add code
Oct 01, 2025
Viaarxiv icon

MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

Add code
Sep 30, 2025
Figure 1 for MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
Figure 2 for MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
Figure 3 for MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
Figure 4 for MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
Viaarxiv icon

WoW: Towards a World omniscient World model Through Embodied Interaction

Add code
Sep 26, 2025
Viaarxiv icon

BEVUDA++: Geometric-aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection

Add code
Sep 17, 2025
Viaarxiv icon

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Add code
Sep 11, 2025
Viaarxiv icon

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

Add code
Aug 28, 2025
Viaarxiv icon

4D Visual Pre-training for Robot Learning

Add code
Aug 24, 2025
Viaarxiv icon

HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

Add code
Aug 23, 2025
Viaarxiv icon

$NavA^3$: Understanding Any Instruction, Navigating Anywhere, Finding Anything

Add code
Aug 06, 2025
Viaarxiv icon